skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Cooper, Erica"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. While large TTS corpora exist for commercial sys- tems created for high-resource languages such as Man- darin, English, and Spanish, for many languages such as Amharic, which are spoken by millions of people, this is not the case. We are working with “found” data collected for other purposes (e.g. training ASR systems) or avail- able on the web (e.g. news broadcasts, audiobooks) to produce TTS systems for low-resource languages which do not currently have expensive, commercial systems. This study describes TTS systems built for Amharic from “found” data and includes systems built from di erent acoustic-prosodic subsets of the data, systems built from combined high and lower quality data using adaptation, and systems which use prediction of Amharic gemination to improve naturalness as perceived by evaluators. 
    more » « less
  2. We compare two approaches for training statistical parametric voices that make use of acoustic and prosodic features at the utterance level with the aim of improving naturalness of the resultant voices -- subset adaptation, and adding new acoustic and prosodic features at the frontend. We have found that the approach of labeling high, middle, or low values for a given feature at the frontend and then choosing which setting to use at synthesis time can produce voices rated as significantly more natural than a baseline voice that uses only the standard contextual frontend features, for both HMM-based and neural network-based synthesis. 
    more » « less
  3. We compare two approaches for training statistical parametric voices that make use of acoustic and prosodic features at the utterance level with the aim of improving naturalness of the resultant voices – subset adaptation, and adding new acous- tic and prosodic features at the frontend. We have found that the approach of labeling high, middle, or low values for a given feature at the frontend and then choosing which setting to use at synthesis time can produce voices rated as significantly more natural than a baseline voice that uses only the standard contextual frontend features, for both HMM-based and neural network-based synthesis 
    more » « less
  4. Extensive TTS corpora exist for commercial systems cre- ated for high-resource languages such as Mandarin, English, and Japanese. Speakers recorded for these corpora are typically instructed to maintain constant f0, energy, and speaking rate and are recorded in ideal acoustic environments, producing clean, consistent audio. We have been developing TTS systems from “found” data collected for other purposes (e.g. training ASR systems) or available on the web (e.g. news broadcasts, au- diobooks) to produce TTS systems for low-resource languages (LRLs) which do not currently have expensive, commercial sys- tems. This study investigates whether traditional TTS speakers do exhibit significantly less variation and better speaking char- acteristics than speakers in found genres. By examining char- acteristics of f0, energy, speaking rate, articulation, NHR, jit- ter, and shimmer in found genres and comparing these to tra- ditional TTS corpora, We have found that TTS recordings are indeed characterized by low mean pitch, standard deviation of energy, speaking rate, and level of articulation, and low mean and standard deviations of shimmer and NHR; in a number of respects these are quite similar to some found genres. By iden- tifying similarities and differences, we are able to identify ob- jective methods for selecting found data to build TTS systems for LRLs. 
    more » « less